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(54) Method and apparatus for controlling the number of servers in a multisystem cluster 



(57) A method and apparatus for controlling the 
number of servers in a multisystem cluster. Incoming 
work requests are organized into service classes, each 
of v^/hich has a queue serviced by servers across the 
cluster. Each service class has defined for it a local per- 
formance index for each particular system of the cluster 
and a multisystem performance index for the cluster as 
a whole. Each system selects one service class as a 
donor class for donating system resources and another 
service class as a receiver class for receiving system 
resources, based upon how well the service classes are 
meeting their goals. Each system then determines the 
resource bottleneck causing the receiver class to miss 
its goals. If the resource bottleneck is the number of 
servers, each system determines whether and how 
many sen/ers should be added to the receiver class, 
based upon whether the positive effect of adding such 
servers on the performance index for Ihe receiver class 
outweighs the negative effect of adding such sen/ers on 
the performance measure for the donor class. If a sys- 
tem determines that servers should be added to the re- 
ceiver class, it then determines the system in the cluster 
to which the servers should be added, based upon the 
effect on other work on that system. To make this latter 
determinatk>n, each system first determines whether 
another system has enough idle capacity and, if so, lets 
that system add servers. If no system has sufficient idle 
capacity, each system then determines whether the lo- 
cal donor class will miss its goals if servers are started 
locally. It not, the servers are started on the local system. 
Otherwise, each system determines where the donor 



class will be hurt the least and acts accordingly. To en- 
sure the availability of a server capable of processing 
each of the work requests in the queue, each system 
determines whether there is a work request in the queue 
with an affinity only to a subset of the cluster that does 
not have servers for the queue and. if so, starts a server 
for the queue on a system in the subset to which the 
work request has an affinity. 
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providing an additional sen/er for a queue may enhanS^fhTntl ? °' ^^^'^"^ resources. Thus, even though 

suchHscrvermaysodegradethepeVmanceSof^^^^^^^^^^^^ 

ol the sysiom as a whole deteriorates ^ "^'"9 ^^"'^'^'^ <he system that the performance 

rels~Xt=^^ for managing the number 
-ndcpcndcn. goals running in the samecomputersysterreclmlnr ^^''"^"'"^^ 

c> HI Sor.H. NO 08/828,440, filed March 2a ^99TZtoTesT^^^"Z^T^^^ D. Aman 

servers on a particular system in whk:h incoming work rCests b^ono^n ,n r^''"'^'"" ""'"^er of 

o, p.ocoss.ng by one or more servers The sysL ateo^as uni^s ofT w ' ^"""^ P'^^^^ a queue 

tl«. .CI. Hs a donor of system resources In acco^ance with .h^ J "^^'^"^ '° ^' service class 
f.rs. serv.ce Cass as well as for at leas, one o.heTs^r^tre " as^^^ Performance measure is defined for the 

.s do,erm,ned no, only ^e posi,ive effec, on the performance meas^^rfn ,? ? '° ^^'^'^^ '^ere 

efleo on ,he porlormance measure for ,he o,her se^ "e cTss se^^^^^^^ "-S^tive 

=rrore~^^^^^^ 
=rt::rs^~~:^ 

single queue of work requests may be sen^ced by sTrvtsXrac" ^r^^^^^ ^'^^^P'^'^')" ■ ^ 

oecs^n may involve no, only whether ,o add a sen/er h,., „Z" '=°'"P'^'' Thus, for a given queue, the 

'°'"^a"ce • «^ere to add the sen/er to optimi7e overall sysplex per- 

S P^Sj-tb: SSlZIng :e~rr"' ^ - ^'^'-^ ^'aim , 

Whether i, .s a large, system a3s":;e%Tr^= :^^^^^^^^^^^^ 

class .s a l.rst se,vice class, the cluster hav.ng at leasrjne o.2 Ier^?rT'"'' " ^ '^''^'^ ^he service 
performance measure defined for it. The step o, determinTnq wheth^^T °' "'^ "^^''^^ ^'^^^^^ having a 

compnses ,he steps of: determ.n,ng a pos.t.ve cIlS" TeCrforr^Lnce ^ "1 '° ^^'^^^ «^^' = 
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of the cluster only if Ihe queue has no server on the systenn. 

[0012] Further, the invention can be considered as a storage device readable by a nrv^chine. tangibly embodying a 
program of instructions executable by the machine to perform the method steps of claim 7. 
[0013] According to a fourth aspect, the invention provides an apparatus as claimed in claim 9. 

5 [0014] The present invention relates to a method and apparatus for controlling the number of servers in a cluster of 
information handling systems in which incoming work requests belonging to a first service class are placed in a queue 
for processing by one or more servers. Some of the incoming work requests may have a requirement to run only on a 
subset of the servers in the cluster Work requests that have such a requirement are said to have an affinity to the 
subset of systems in the cluster that they must run on. In accordance with this invention, servers are started on one 

10 or more of the systems in the clusters to process the work requests in the queue. The systems on which to start these 
servers are chosen to take advantage of the total capacity of the cluster of systems, to meet the affinity requirements 
of the vi/ork requests, and to minimize the effect on other work that might be running on the systems in the cluster The 
system on which new servers are started also has units of work assigned to a second service class that acts as a donor 
of system resources. In accordance with the invention, a performance measure is defined for the first service class as 

15 well as for the second service class. Before adding servers to the first service class, there is determined not only the 
positive effect on the performance measure for the first service class, but also the negative effect on the performance 
measure for the second service class. Servers are added to the first service class only if the positive eflect on the 
performance measure for the first sen^ice class outweighs the negative effect on the performance measure for the 
second service class. 

20 [0015] The present invention allows system management of the number of servers across a cluster of system for 
each of a plurality of user performance goal classes based on the performance goals of each goal class. Tradeoffs are 
made that consider the impact of addition or removal of servers on competing goal classes. 

[0016] Preferred embodiments of the present invention will now be described in conjunction with the following draw- 
ings. 

2S [0017] Fig. 1 is a system structure diagram showing particularly a computer system having a controlling operating 
system and system resource manager component adapted as described for the present invention. 
[0018] Fig, 1 A shows the flow of a client work request from the network to a server address space managed by the 
workload manager of the present invention. 

[0019] Fig. "2 illustrates the state data used to select resource bottlenecks. 
30 [0020] Fig. 3 is a flowchart showing logic flow for the find-bottleneck function. 

[0021] Fig. 4 is a flowchart of the steps to assess improving performance by increasing the number of servers. 
[0022] Fig. 5 is a sample graph of queue ready user average. 
[0023] Fig. 6 is a sample graph of queue delay. 

[0024] Fig. 7 shows the procedure for ensuring that there is at least one server somewhere in the cluster that can 

35 run each request on the queue. 

[0025] Fig. 8 shows the procedure for determining the best system in the cluster on which to start a server 
[0026] Fig. 9 shows the procedure for finding the system where the impact on the donor work is the smallest. 
[0027] As a preliminary to discussing a system incorporating a preferred embodiment of the present invention, some 
prefatory remarks about the concept of workload management (upon which the present invention builds) are in order 

40 [0028] Workload management is a concept whereby units of work (processes, threads, etc.) that are managed by 
an operating system are organized into classes (referred to as service classes or goal classes) that are provided system 
resources in accordance with how well they are meeting predefined goals. Resources are reassigned from a donor 
class to a receiver class if the improvement in performance of the receiver class resulting from such reassignment 
exceeds the degradation in performance of the donor class, i.e., there is a net positive effect in performance as deter- 

45 mined by predefined performance criteria. Workload management of this type differs from the run-of-the-mill resource 
management performed by most operating systems in that the assignment of resources is determined not only by its 
effect on the work units to which the resources are reassigned, but also by its effect on the work units from which they 
are taken. 

[0029] Workload managers of this general type are disclosed in the following commonly owned patents, pending 
50 patent applications and non-patent publications, incorporated herein by reference: 

U.S. Patent 5,504,894 to D. F. Ferguson et al., entitled "worktoad Manager for Achieving Transaction Class Re- 
sponse Time Goals in a Multiprocessing System"; 

55 U.S. Patent 5,473,773 to J. D. Amanetal.. entitled "Apparatus and Method for Managing a Data Processing System 

workload According to Two or More Distinct Processing Goals"; 

U.S. Patent 5.537.542 to C. K. Eilert et al., entitled "Apparatus and Method for Managing a Server workload Ac- 
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cording to Client Performance Goals in a Client/Server Data Processing System"; 

U.S. Patent 5.603.029 to J. D. Aman et al., entitled 'System of Assigning Work Requests Based on Classifying 
into an Eligible Class where the Criteria Is Goal Oriented and Capacity Information is Available"; 

5 

U.S. Patent 5,675,739 to C. K. Eilert et al., entitled "Apparatus and Method for Managing a Distributed Data 
Processing System Workload According to a Plurality of Distinct Processing Goal Types'; 

U.S. application Serial No. 08/383.042. filed Februarys, 1995, of C. K. Eilert etal., entitled "Multi-System Resource 
10 Capping"; 

U S, application Serial No. 08/488,374, filed June 7. 1995. of J. D. Aman et aL. entitled "Apparatus and Accom- 
panying Method for Assigning Session Requests in a Multi-Server Sysplex Environment"; 

IS U.S. application Serial No. 08/828,440. filed March 28, 1997. of J. D. Aman et aL, entitled "Method and Apparatus 

for Controlling the Number of Servers in a Client/Sender System"; 

MVS Planning: workload Management. IBM publication GC28-1761-00, 1996; 

MVS Programming: Workload Managemen! Services, IBM publication GC28-1 773-00. 1996. 

[0030] Of the patents and applications, U.S. Patents 5,504,894 and 5.473,773 disclose basic workload management 
systems; U.S. Patent 5,537,542 discloses a particular application of the workload management system of U.S. Patent 
5.473,773 to client/server systems; U.S. Patent 5.675,739 and application 08/383,042 disclose particular applications 

2S of the workload management system of U.S. Patent 5,473,773 to multiple interconnected systems; U.S. Patent 
5,603,029 relates to the assignment of work requests in a multisystem complex ("sysplex"); application 08/488,374 
relates to the assignment of session requests in such a complex; and, as noted above, application 08/828,440 relates 
to the control of the number of servers on a single system of a multisystem complex. The two non-patent publications 
describe an implementation of workload management in the IBM« OS/3906 (formerly MVS«) operating system. 

30 [0031] Fig. 1 illustrates the environment and the key features of the present invention for an exemplary embodiment 
comprising a cluster 90 of interconnected, cooperating computer systems 100. an exemplary two of which are shown. 
The environment of the invention is that of a queue 161 of work requests 162 and a pool of servers 163 distributed 
across the cluster 90 that service the work requests. The invention allows management of the number of servers 163 
based on the performance goal classes of the queued work and the performance goal classes of competing work in 

35 the systems 100. Having a single policy for the cluster 90 of systems 100 helps provide a single-image view of the 
distributed workload. Those skilled in the art will recognize that any number of systems 100 and any number of such 
queues and groups of servers 163 within a single computer system 100 may be used without departing from the spirit 
or scope of the invention. Computer systems 100 execute a distributed workload, and each is controlled by its own 
copy of an operating system 101 such as the IBM OS/390 operating system. 

40 [0032] Each copy of the operating system 101 on a respective computer system 100 executes the steps described 
in this specification, when the description herein refers to a "local" system 1 00, it means the system 1 00 that is executing 
the steps being described. The "remote" systems 100 are all the other systems 100 being managed. Note that each 
system 100 considers itself local and all other systems 100 remote. 

[0033] Except for the enhancements relating to the present invention, system 100 is similar to the ones disclosed in 
45 copending application 08/282,440 and U.S. Patent 5,675,739. As shown in Fig. 1, system 100 is one of a plurality of 
interconnected systems 1 00 that are similarly managed and make up a cluster 90 (also referred to as a system complex, 
or sysplex). As taught in U.S. Patent 5.675,739, the performance of various service classes into which units of work 
may be classilied may be tracked not only for a particular system 100, but for (he cluster 90 as a whole. To this end. 
and as will be apparent from the description below, means are provided for communicating performance results between 
50 system 100 and other systems 100 in the cluster 90. 

[0034] Dispatcher 102 is a component of the operating system 101 that selects the unit of work to be executed next 
by the computer The units of work are the application programs that do the useful work that is the purpose of the 
computer system 100. The units of work that arc ready to be executed arc represented by a chain of control blocks in 
the operating system memory called the addr ss space control block (ASCB) queue. 
ss [0035] Work manager 160 is a component outside of the operating system 101 which uses operating system services 
to define one or more queues 161 to a workload manager (WLM) 105 and to insert work requests 162 onto these 
queues. Th work manager 160 maintains the inserted requests 162 in first-in first-out (FIFO) order for selection by 
servers 163 of the work manager 160 on any of the systems 100 in the cluster 90. The work manager 160 ensures 
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lhat a server only selects requests thai have affinity to the system 100 that the server is running on. 
[0036] Servers 163 are components o( the work manager 160 which are capable of servicing queued work requests 
162- when the worktoad manager 105 starts a server 163 to service requests 162 for a work manager I50's queue 
161 , the workload manager uses server definitions stored on a shared data facility 140 to start an address space (i.e., 
5 process) 164. The address space 164 started by the workload manager 105 contains one or more servers (i.e., dis- 
patchable units or tasks) 163 which service requests 162 on the particular queue 161 lhat the address space should 
service, as designated by the workload manager. 

[0037] Fig. lA shows the flow of a client work request 162 from a network (not shown) to which system 100 is 
connected to a server address space 164 managed by the workload manager 105. A work request 162 is routed to a 
10 particular system 100 in the cluster 90 and received by a work manager 160. Upon receiving the work request 162, 
the work manager 160 classifies it to a WLM service class and inserts the work request into a work queue 161. The 
queue 161 is shared by all systems 100 in the cluster 90; i.e., queue 161 is a cluster-wide queue. The work request 
162 waits in the work queue 161 until there is a server 163 ready to run it. 

[0038] A task 163 in a server address space 164 on some system 100 in the cluster 90 that is ready to run a new 

IS work request 162 (either the space has just been started or the task finished running a previous request) calls the work 
manager 160 lor a new work request. If there is a request 162 on the work queue 151 the address space 164 is serving 
and the request has affinity to the system 100 on which the server is running, the work manager 1 60 passes the request 
to the server 163. Otherwise, the work manager 160 suspends the server 163 until a request 162 is available. 
[0039] When a work request 162 is received by a work manager 160, it is put on a work queue 161 to wait for a 

20 server 163 to be available to run the request. There is one work queue 161 for each unique combination of work 
manager 160, application environment name, and WLM sen/ice class of the work request 162. (An application envi- 
ronment is the environment that a set of similar client work requests 162 needs to execute. In OS/390 terms this maps 
to the job control language (JCL) procedure that is used to start the server address space to run the work requests.) 
The queuing structures are built dynamically when the first work request 162 for a specific work queue 161 arrives. 

25 The structures arc deleted when there has been no activity for a work queue 161 for a predetermined period of time 
(e.g., an hour). If an action is taken that can change the WLM service class of the queued work requests 162, like 
activating a new WLM policy, the workload manager 105 notifies the work manager 160 of the change and the work 
manager 160 rebuilds the work queues 161 to reflect the new WLM service class of each work request 162. 
[0040] There is a danger that a work request 162 that has a affinity to a system 100 with no servers 163 might never 

30 run if there are enough sen/ers on other systems 100 to allow the work request's service class to meet its goal. To 
avoid this danger the workload manager 105 ensures that there is at least one server 163 somewhere in the cluster 
90 that can run each request on the queue 161 . Figure 7 shows this bgic. This logic is run by the work manager 160 
on each system 100 in the cluster 90. 

[0041] At step 701 the work manager 160 looks at the first queue 161 it owns. At step 702 the work manager 160 
35 checks to see if there is a server 163 for this queue 161 on the local system 100. If there is a server 163, the work 
manager 160 goes on to the next queue 161 (steps 708-709). If the work manager 160 finds a queue 161 with no 
servers 163 locally, the work manager next looks at each work request 162 on the queue, beginning with the first work 
request (steps 703-706). For each work request 162 the work manager 160 checks if there is a server 163 anywhere 
in the cluster 90 that can run the current work request (step 704). If there is a server 163 that can run the current work 
40 request, the work manager 160 goes on to the next request 162 on the queue 161 (step 706). If there is no server 163 
that can run the work request 1 62, the work manager 1 60 calls the workload manager 105 to start a server 1 63 (step 
707) and then goes on to the next queue 161 (step 708-709). The work manager 160 continues in a similar manner 
until all queues 161 owned by the work manager have been processed (step 710) 

[0042] To determine the best system 100 on which to start a server for a request when the work manager 160 calls 
45 workload manager 105 (707), workload manager 105 keeps a Service Available Array tor each system 100 in the 
cluster 90 which indicates the service available at each importance and the unused service for that system. The array 
includes an entry for each importance (e.g. importances 0-6) and one for unused service, as depicted below: 



Array Element 


Array Element Content 


array element 1 
array element 2 
array element 3 
array element 4 
array element 5 
array element 6 
array element 7 


service avail, at importance 0 
service avail, at importance 1 
service avail, at importance 2 
service avail, at importance 3 
service avail, at importance 4 
service avail, at importance 5 
service avail, at importance 6 
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Array Element 


Array Elenn nt Cont nt 


array elenneni 8 


unused service 



Envi,onme„,-, incorporated hsr.i,,7y rItrTe ' " ' """-VS»m 

problen^ ,n Mhsr innpl»„»niatos lhal au.omalically slan spaces Thus LartiTn ^ • °^ "'"™" 

, »--^-"--pro,ress,„™TH,sp»,4a».:,~-r 

isr„.:s^s„?s: ~r:r ^^^^^^^^^^^ 

which the likev cause is a JCL error in the JCL oroc for thP anniL. ^ ^ ' ^""'^^^ encountered for 

isr 'zr TT - ^^^^r '^"'0 = 

work requestsfortheapplicatlon environrnent un^l nf^n h "^"^9^^ '° ^'^P ^^^eP'ing 

environment even though it normaJy on^ series S^ ^k que : rpX^^ ^^"^'^ 
^5 1 64 is no longer needed to support its work aueue 1 Gi rt Tnl , ? . '^'^'^'^'"y- "^^^ ^ server address space 
space 164 waits for a period oH^e asrCee aaen M^^^ T "^'"^'^•^'V '"^'^^d, the server address 

the sar^e app.icat^n -vironrne:t^rthVs::,:rar^^^^^^^^^^^ with 
overhead of starting a new server address soace for thJZrL '° ^ work queue 161, the 

not needed by another work queue l^^l Shr^ nrS.,. . ^ " " '^^ '^'^^^ ^^^^^^s ^P^ce 164 is 



3, mportance of each goal. The goals 141 are read Into each y;;^! 00 b^^ 

the operating system 101 on each of the systems 100 being manaqed Each oHhrnSc t, ^^5 of 

specified by the system administrator, causes the workk^d Znaoor loT^ . ^ ""^'^ established and 

ance class to which individual work units are assiaTed Each n J ^ ' °° '° ^"'^''''^^ ^ P°^°""- 

operating systems 101 by a class table ent^ 1 06 ^he sn/.^i^f T?'^ represented in the memory of the 

mation relating to the perflmance ctrarreco'dS n t^^^^^^^^^^^^^ Tn^: ',"^T"^"'°"^ "^^^ 

entry includes the number 107 of servers 1ST i» rr^,r^,Z '"^^f f O'^^e^ information stored in a class table 

input value), the multisystem pe/ZaTce tndlx tRn '"^^'^^'^^ °' '^^ 9°^' '^'-^ 

sentlngperformancemLsuresj.theresXsSolniit^ (computed values repre- 

value), sample data 113 (measu eddatanhe!pZo^ ^ ? 'nput value), the execution vekxity goal 111 (an input 

•^istory158rrneasu..ddS~e';t^^^^^^^^^ 

ureddata). ^ and the response time history 126 (meas- 

[0049] Operating system 101 includes a system resource mananpr ^c;rm\ ho ^- ^ 

goal-driven performance controller (MGDPC) 114 Thrp ™ ^ ' '"'^'"^^^ ^ multisystem 

5.473,773 and 5.675,739 Howler MGDPC 11 4Vm^rr'^ !. ' "''"'"^ generally as described in U S Patents 
Of servers 163. MGDPC 11 J^/orms the fun^J^^^ 
'° formance goal classes that need tZcrrfoiT """9 the achievement of goals, selecting the user per- 

goal classe^elid b moSSr^^^^^ '""'T' ''^ P«^-— 

function is periormedperiodicairbaserna ^iS^^^^^^^^^ 

embodiment. The interval at whrfi mSLp?!^^^^ '"""^ ^^'^^"ds in the preferred 

inten^al. P«rf°f"^ed is called the MGDPC interval or policy adjustment 

[0050] The general manner of operation of MGDPC 114 ac , , 

115. a multisystem performance index Israndainrll Jo ''^"'^"''^d in U.S. Patent 5.675.739, is as follows. At 
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ot work units associated with the goal class across all the systems 100 being managed. The local pertormance index 
152 represents the per1orm;ince of work units associated with the goal class on the local system 100. The resulting 
performance indexes 1 51 . 1 52 are recorded in the corresponding class table entry 106. The concept of a performance 
index as a method of measuring user performance goal achievement is well known. For example, in the above-cited 
5 US. Patent 5.504.894 to Ferguson el al.. the performance index is described as the actual response time divided by 
the goal response time. 

[0051] At 116. a user performance goal class is selected to receive a performance improvement in the order of th 
relative goal importance 1 08 and the current value of the performance indexes 1 51 . 1 52. The selected user performance 
goal class is referred to as the receiver. MGDPC 1 1 4 first uses the multisystem performance index 1 51 when choosing 

10 a receiver so that the action it takes has the largest possible impact on causing work units to meet goals across all the 
systems 100 being managed. When there is no action to take based on the multisystem performance index 151. the 
local performance index 152 is used to select a receiver that will most help the local system 100 meet its goals. 
[0052] After a candidate receiver class has been determined, the controlled variable for that class that constitutes a 
performance bottleneck is determined at 117 by using state samples 125. a well-known technique. As described in U. 

IS S. Patent 5,675:739, the controlled variables include such variables as protective processor storage target (affects 
paging delay), swap protect time (SPT) target (affects swap delay), multiprogramming level (MPL) target (affects MPL 
delay), and dispatch priority (affects CPU delay). In accordance with the present invention, the controlled variables 
also include the number ot servers 163. which affects queue delay. 

[0053] In Fig. 1 the number 107 of servers 163 is shown stored in the class table entry 106, which might be taken 
20 lo imply a limitation of one queue 161 per class. However, this is merely a simplification for illustrative purposes; those 
skilled in the art will recognize that multiple queues 161 per class can be independently managed simply by changing 
the location of the data. The fundamental requirements are that the work requests 162 for a single queue 161 have 
only one goal, that each server 163 has equal capability to service requests, and that a sen/er cannot service work on 
more than one queue 161 without notification from and/or to the workload nnanager 105. 
25 [0054] After a candidate performance bottleneck has been identified, the potential changes to the controlled variables 
are considered at 118. At 123 a user performance goal class is selected for which a performance decrease can be 
made based on the relative goal importance 108 and the current value of the performance indexes 151 , 152. The user 
performance goal class thus selected is referred to as the donor. 

[0055] After a candidate donor class has been selected, the proposed changes are assessed at 124 for net value 
30 relative to the expected changes to the multisystem and local performance indexes 151, 152 for both the receiver and 

the donor for each of the controlled variables, including the number 107 of servers 163 and the variables mentioned 

above and in U.S. Patent 5.675,739. A proposed change has net value if the result would yield more improvement tor 

the receiver than harm to the donor relative to the goals. If the proposed change has net value, then the respective 

controlled variable is adjusted for both the donor and the receiver 
35 [0056] Each system 100 to be managed is connected to a data transmission mechanism 1 55 that allows each system 

100 to send data records to every other system 100. At 153 a data record describing the recent performance of each 

goal class is sent to every other system 100. 

[0057] The multisystem goal driven performance controller (MGDPC) function is performed periodically, (once every 
ten seconds in the preferred embodiment) and is invoked via a timer expiration. The functioning of the MGDPC provides 
40 a feedback loop for the incremental detection and correction of performance problems so as to make the operating 
system 101 adaptive and self-tuning. 

[0058] At the end of the MGDPC interval a data record describing the performance of each goal class during the 
interval is sent to each remote system 100 being managed, as generally described in U.S. Patent 5.675,739. For a 
performance goal class having response time goals, this data record contains the goal class name and an array with 

-^5 entries equivalent lo a row of the remote response time history that describes the completions in the goal class over 
the last MGDPC inten^al. For a goal class with velocity goals this data record contains the goal class name, the count 
of times work in the goal class was sampled running in the last MGDPC interval, and the count of times work in the 
goal class was sampled as running or delayed in the last MGDPC interval. In accordance with the present invention, 
each system 100 sends as additional data the Service Available Array for the system 100 sending the data, the number 

50 of servers 1 63 for each queue 1 61 , and the number of idle servers 1 63 for each queue 161. 

[0059] At 154 a remote data receiver receives performance data from remote systems 100 asynchronously from 
MGDPC 114- The received data is placed in a remote performance data histories (157,1 58) for later processing by the 
MGDPC 114. 

[OOGO] Fig. 2 illustrates the state data used by find bottleneck means 117 to select resource bottlenecks to address. 
55 For each delay type, the performance goal class table entry 106 contains the number of samples encountering that 
delay type and a flag indicating whether the delay type has already been selected as a bottleneck during the present 
invocation of MGDPC 114. In the case of the cross-memory-paging type delay, the class table entry 106 also contains 
identifiers of the address spaces that experienced the delays. 
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[0065] At 507 a check is made to determine whether the paging delay type has the largest number of delay samples 
15 of all the delay types thai have not yet been selected. If yes, at 508 the paging-delay-selected flag is set and paging 
delay is returned as the next bottleneck to be addressed. There are five types of paging delay At 507: the type with 
the largest number of delay samples is located, and at 508, the flag is set for the particular type and the particular type 
is returned. The types of paging delay are: private area, common area, cross memory, virtual input/output (VIO): and 
hyperspace, each corresponding to a page delay situation well known in the environment of the preferred embodiment 
20 (OS/390). 

[0066] Finally, at 509 a check is made to determine whether the queue delay type has the largest number of delay 
samples of all the delay types that have not yet been selected. A class gets one queue delay type sample for each 
work request on the queue 161 that is eligible to run on the local system 100. If yes, at 510 the queue-delay-selected 
flag is set and queue delay is returned as the next bottleneck to be addressed. Queue delay is not addressed on the 
25 local system 100 if another system 100 in the cluster 90 has started servers 163 for the queue 161 during the last 
policy adjustment interval. Queue delay is also not addressed if the candidate receiver class has swapped out ready 
work. 

[0067] The following section describes how the receiver performance goal class pertomiance is improved by chang- 
ing a controlled variable to reduce the delay selected by the find bottleneck means 11 7 and, in particular, how perform- 

30 ance is improved by reducing the queue delay experienced by the receiver For a shared queue 161 this is a two-step 
process. First an assessment is made of adding the servers 163 on the local system 100 including the impact on the 
donor work. If there is net value in adding the sen/ers 163, the next step is todetermine if the sen/ers should be started 
on the local system 100 or they should be started on another system 100 in the cluster 90. If a remote system 100 
seems like a better place to start the servers 163, the local system 100 waits to give that system a chance to start the 

35 servers. However if that system 100 does not start the servers 163, the local system 100 starts them, as described 
below in conjunction with Fig. 8. 

[0068] Fig. 4 shows the logic flow to assess improving performance by starting additional servers 163. Figs. 4-6 
illustrate the steps involved in making the performance index delta projections provided by the fix means 118 to the 
net value means 124. At 1401 , a new number of servers 163 is selected to be assessed. The number must be large 

'fo enough to result in sufficient receiver value (checked at 1405) to make the change worthwhile. The number must not 
be so large that the value of additional servers 163 is marginal, for example, not more than the total number of queued 
and running work requests 162. The next step is to calculate the additional CPU the additional servers 163 will use; 
this is done by multiplying the average CPU used by a work request by the additional servers 163 to be added. 
[0069] At 1402, the proiected number of work requests 162 at the new number of servers 163 is read from the sender 

45 ready user average graph shown in Fig. 5. At 1403, the current and projected queue delays are read from the queue 
delay graph shown in Fig. 6. At 1404, ihe projected local and multisystem performance index deltas are calculated. 
These calculations are shown below. 

[0070] Fig. 5 illustrates the queue ready user average graph. The queue ready user average graph is used to predict 
the demand for servers 163 whai assessing a change in the number of servers 163 for a queue 161. The graph can 

50 show the point at which vrork requests 162 will start backing up. The abscissa (x) value is the number of servers 163 
available to the queue 161. The ordinate (y) value is the maximum number of work requests 162 ready to execute. 
[0071] Fig. 6 illustrates the queue delay graph. The queue delay graph is used to assess the value of increasing or 
decreasing the number of servers 163 fa a queue 161. The graph shows how response time may be improved by 
increasing the number of queue sewers 163 or how response time may be degraded by reducing the number of queue 

55 servers 163. It also will implicitly consider contention for resources not managed by the workload manager 105 which 
might be caused by adding additional servers 1 63, for example, database lock contention. In such a case the queue 
delay on the graph will not decrease as additional servers 163 are added. The abscissa value is the percentage of 
ready work requests 162 that have a sewer 163 available and swapped in across the cluster 90 of systems 100, The 
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ordinate value is the queue delay per completion. 

[0072] Sysplex (i.e.. multisystem) performance index (PI) deltas for increases in the number of servers 163 are 
calculated as follows. Note that only sysplex pertormance index deltas are calculated because a queue 1 61 is a sysplex 
wide resource. 

5 

For response time goals: 
[0073] 

(projected sysplex PI delta) - 
(projected queue delay - current queue delay) / response time goal 

15 For velocity goals: 
[0074] 

cpuu + ((cpu u/oldserver) * newserver) 
^y^P'^'' velocity) = ^^^ .^^^ ^ ((qd/qreq) * (oldserver - newserver)) 

(sysplex PI delta) = 
(current sysplex PI - goal) / new sysplex velocity 

25 

Where: 

cpuu is the sysplex CPU-using samples; 

30 

ofdserver IS the number of servers 163 before the change being assessed is made; 
newsen/er is the number of servers 163 after the change being assessed is made; 
35 non-idle is the total number of sysplex non-idle samples; 

qd is the sysplex queue delay samples; and 
qreq is the number of work requests 162 on the queue 161. 

40 

[0075] Similar calculations are used to calculate performance index deltas for decreases in the number of servers 
163. 

[0076] At 1405. a check is made for sufficient receiver value provided by the additional number of servers 163. 
Preferably, this step includes the step of determining whether the new servers 163 would get enough CPU time to 
45 make adding them worthwhile. If there is not sufficient receiver value, control returns to 1401 where a larger number 
of servers 1 63 is selected to be assessed. 

[0077] If there is sulficienl receiver value, at 1406 select donor means 123 is called lo find donors (or the storage 
needed to start the additional servers 163 on behalf of the receiver performance goal class. 

[0078] The controlled variable that is adjusted for the donor class need not necessarily be the number 1 07 of servers 
50 163 for that class. Any one of several different controlled variables of the donor class, such as MPL slots or protected 
processor storage, may be alternatively or additionally adjusted to provide the necessary storage for the additional 
servers 1 63. The manner of assessing the effect on the donor class of adjusting such controlled variables, while forming 
no part of the present invention, is described in U.S. Patents 5.537,542 and 5,675.739. 

[0079] At 1407, a check is made to ensure that there is net value in taking storage from the donors to increase the 
55 number of sen/ers 163 for the receiver class. As described in U. S. Patent 5.675.739, this may be determined using 
one or more of several different criteria, such as whether the donor is projected to meet its goals after the resource 
reallocation, whether the receiver is currently missing its goals, whether the receiver is a more important class than 
the donor, or whether there is a net gain in the combined performance indexes of the donor and the receiver - i.e.. 
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20 



25 



30 



Whether the positive effect on the performance index for the receiver cla.^.. of =.hhv, 

outweighs the negative effect on the perforrriance index of the donor , ^ '^'^^'^ '° '^"^^^^ '^'^^^ 

.here is net value, the next step is to deterrnine S'he^SaTsvs^^^^^^ VT^ '""'^^ '° " 

new servers 163 (1408): otherwise the receJ^L ao^l S.T ^ ^^''^"^ '^^ '^'"^'^^ 9° '° start the 

^ [0080, Fig. 8 shows the Procedure'^oVd":^! ^ '^'^^^^^^^^^^^^^^^^ -'-^ ^409). 

90 on which to start the new servers 1 63 This orocedure i^ rionl !! '^'''^f ^"""9 '^^ ^^^^ systenn 1 00 in the cluster 
90, once it is determined that there is net vl'Tot^ n^^r : ^st^^^^^ ^^^'^"^ 
added to the receiver class The local system 100 fir<^i rhoTJ? , °^ °' ^^^^^^ sf'ou'd be 

idle capacity to support the new Sirs 6^ ZoLf il. ? '"^ ^^''""^ ^'"^'^^ 9° «no"9h 

service Available Ar^y for each sy^te^l^trc^^J^^^^^^ "^^'^ looKing a. the 

available at array element 8 (unuLd CpTsV^iT^lll:^"^^^^^^ 1 mutV"""^^ ^^^'^^^ 

sufficient unused CPU service, the system 100 with tho moeT.L T " multiple systems 100 have 

idle servers 163 are not chosen becauslTa svstriOO hT. ih^ '"T^ ^^^'^-^^ ^00 with 

rmany worK requests do not have a«"; toSy^^^^^^^ "^^'^ ''^-^ -^-st i, means 

system 100 starts the servers 163 liX Jeps 804-^5; " '^'^^ «°^>- '^^^ 

checKs .0 see i, anotL, system Voo\TL7:^lZ:Ts TLTZoTTJT''''' ^^'^'"^^^ 
servers 163, no action is taken locally (steo 8in If noo.hl! . ^y®'^"^ started the 

100 Checks to see i, it has sufficTenr^JrcPu capar^^^^^^ ''^ ^^""^ '^^ ^^stem 

system 100 starts the servers 163 locally (steps 8i3 ^ ^ ^'^^ " " l^^^- «he local 

[0083J Control passes to step 808 if there were no systems 100 that had sufficient iril. rPi i 
now sorvors ,63 or there was a system 100 that had sufficient idle CPU caoac^t^Z T^ 'JT 
servers. One reason that such a system 100 may not start serversTsJ rhlTfK^ ^^^""^ ""^ ^'^^ 

it is known that the newservers 163 cannot be started wit hout^llt .1 f ^ ^''^^^S^ At this point 

checks to see if starting the servers ^63 lc^airwircrlr^hH ^ !''°"°^ The local system 100 therefore 
n^lss its goals, the loca^system 100 "artSe^ri^^^^^^^^^ " - 

will cause the donor class to miss its goal the local svst,.m irv. .h!lTw I ^ ^^'"^ ^^"^^^^ locally 
donor work is the smallest (step 809) ^ ""''^ ^^^^^"^ ^^^^ 'f'e impact on the 

[0084] Fig. 9 shows the routine for determining at 809 the svstem inn «,hor^ ,k 

smallest. The routine first sends the name of the Lor clss and he doTo.^^^^^^^^ °" ^^^"^ ''^^ 

systems 1 00 in the cluster 90 (step 901 ) By exchTaino thirr^no, , '"^ex (PI) delta to the other 

35 adding servers 163 for .he receiver class caJ S the iSac o^ d^^^^^^^^^^ 

routine then waits one policy inten/al (10 sl^^ondVin thTpmh^ ?k ^"'^'^ °" ^" ^^^'^^^^ The 

their donor information (step 90rThe rotrelen let'^^^^ "^^"^ '° «° ^-"d 

portance (step 903) and returnVthe le e ?eTsy J^^^^^ lotto hi' T "^"'^ i'"" 

is a tie on donor importance (step 90^ ll rSe sl?^ s me v Z 1^'^!° 'T'""'" '''^ " ''^-^ 

- delta is the smallest (step 906) a'nd re.lns thi^yttl^^l^or th^i^^^^^^ 

[0085] Referring again to Fig. 8. after completing step 809 the local Lterr^ inn?h!?. 
selected as having the least donor impact is the local system (step sTo !f irtho 7 . ° " '^^'^'^ 
163 locally (steps 817-818) Otherwise the local .rv<.^S . ^ "'tis, the local system 1 00 starts the servers 

start the sen/ers 163 (step SIsTan^ at' ^VZ J'SsZZjT ! 'f''^ '° ^"^♦•^^^ ^y^'^"^ 100 to 

- servers 163 (step 816;. I, another system t^^hafsranJ^^^^^^^^^ ^^^LT"'' ^'^-^^^ 
818). If no other system 100 has started the servers ifi^,h!.!!T ' ^^^'^"^ ^00 takes no action (step 
[0086, At 1408, logic is included to tempo arTy dJe^^^^^^^ «^^-«l^) 
certain circumstances. Concurrent requests tostarinprIr, «o T newservers 163 for the queue 161 under 
work This pacing ensures that the iTatg sys e^iol^ ^s no IcSd" r""' -P- 'o-isting 

0 servers 163. which can be disruptive De SofTauTv in^o^^^ f 

administrator is also implementeV o P^I e^r^nife et^Zs^ se' ^'^^^^'^^^ ^'^^''-^ system 
degree that new servers 163 cannot be successfu Iv '"1°^"^>*°" incorrect to the 

automatically replace a server shouldTfirunoxp^^^^^^^^^^ JoTor: ferjiti ido^'^T"'' ^"'^•"^^^ '° 

but serving different queues 161 for the same work manlopr ifT !f definition information 
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has been shown and described, it will be apparent to those skilled in the art that other embodiments beyond the ones 
specifically described herein may be made or practised without departing from the spirit of the invention. It will also will 
be apparent to those skilled in the art that various equivalents may be substituted for elements specifically disclosed 
herein. Similarly, changes, combinations and modifications of the presently disclosed embodiments will also be appar- 
5 enl. For example, multiple queues may be provided for each service class rather than the single queue disclosed 
herein. The embodiments disclosed and the details thereof are intended to teach the practice of the invention and are 
inicndcd to be illustrative and not limiting. 

[0088] A method and apparatus for controlling the number of servers in a multisystem cluster. Incoming work requests 
Hfc organized into service classes, each of which has a queue serviced by servers across the cluster. Each service 

JO class has defined for it a local performance index for each particular system of the cluster and a multisystem perform- 
ancc index for the cluster as a whole. Each system selects one service class as a donor class for donating system 
resources and another service class as a receiver class tor receiving system resources, based upon how well the 
service clHsses are meeting their goals. Each system then determines the resource bottleneck causing the receiver 
clHss to miss its goats. If the resource bottleneck is the number of servers, each system determines whether and how 

15 mnny servers should be added to the receiver class, based upon whether the positive effect of adding such servers 
on iho per lormance index for the receiver class outweighs the negative effect of adding such servers on the performance 
mcrtsurc for the donor class. If a system determines that servers should be added to the receiver class, it then deter- 
mines the system in the cluster to which the servers should be added, based upon the effect on other work on thai 
system To make this latter determination, each system first determines whether another system has enough idle 

20 capHCity and. tl so. lets that system add servers. If no system has sufficient idle capacity, each system then determines 
wliothei the local donor class will miss its goals if servers are started locally. It not, the servers are started on the local 
system Otherwise, each system determines where the donor class will be hurt the least and acts accordingly. To ensure 
the availability ot a server capable of processing each of the work requests in the queue, each system determines 
whether there is a work request in the queue with an affinity only to a subset of the cluster that does not have servers 

25 for the queue and. if so, starts a server for the queue on a system in the subset to which the work request has an affinity. 



Claims 



30 1. In a cluster of information handling systems in which incoming work requests belonging to a service class are 
placed in a cluster-wide queue for processing by one or more servers on the systems of the cluster, a method of 
controlling the number of such servers, comprising the steps of: 



35 



40 



determining whether one or more servers should be added to the service class; 

determining a target system in the cluster on which the servers should be added if it is determined that one 
or more servers should be added to the service class: and 

adding the servers on the target system 

2. The method of claim 1 in which the service class is a first service class, the cluster having at least one other service 
class= each ot the service classes having a performance measure defined for it. 

3, The method of claim 2 in which the step ol determining whether servers should be added to the first service class 
45 comprises the steps of: 

determining a positive effect on the performance measure for the first service class of. adding a predetermined 
number of servers to the first service class. 

so determining a negative effect on the performance measure lor one or more other service classes of adding 

the predetermined number of servers to the first service class; and 

determining whether the positive effect on the performance measure for the first service class outweighs tho 
negative effect on the performance measure for the one or more other service classes. 



55 



The method of claim 1 in which the step of determining a target system in the cluster on which the servers should 
be added comprises the steps of: 
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determining whether any system in the cluster has sufficient Hio 

se^e^s: ^nd ""^^ ^^^'^•^'^^ 'd'e capacrty to add the one or more additional 

conlrolLng the number of such servers, comprising: " "^^'^^^ °' apparatus for 

rneans for determining whether one or more severs should be added to the service Cass- 

means for adding the servers on the target system. 

not have servers for the queue. ^ ^ ^"'""^ '° ^ subset of the cluster that does 

each of the work requests in the queue, comprising. ^ ava.iab.lrty of a server capable of processing 

does not have servers for the queue. ^ ^^'"''^ ^^'^ ^ ^"bset of the cluster that 
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Method and apparatus for controlling the number of servers in a multisystem cluster 



(57) A method and apparatus for controlling the 
number of servers in a multisystem cluster Incoming 
work requests are organized into service classes, each 
of which has a queue serviced by servers across the 
cluster Each service class has defined for it a local per- 
formance index for each particular system of the cluster 
and a multisystem performance index for the cluster as 
a whole. Each system selects one service class as a 
donor class for donating system resources and another, 
service class as a receiver class for receiving system 
resources, based upon how well the service classes are 
meeting their goals. Each system then determines the 
resource bottleneck causing the receiver class to miss 
its goals. If the resource bottleneck is the number of 
servers, each system determines whether and how 
many servers should be added to the receiver class, 
based upon whether the positive effect of adding such 
servers on the performance index for the receiver class 
outweighs the negative effect of adding such servers on 
the performance measure for the donor class. If a sys- 
tem determines that servers should be added to the re- 
ceiver class, it then determines the system in the cluster 
to which the servers should be added, based upon the 
effect on other work on that system. To make this latter 
determination, each system first determines whether 
another system has enough idle capacity and. if so, lets 
that system add servers. If no system has sufficient idle 
capacity, each system then determines whether the lo- 



cal donor class will miss its goals if servers are started 
locally. It not, the servers are started on the local system. 
Otherwise, each system determines where the donor 
class will be hurt the least and acts accordingly To en- 
sure the availability of a server capable of processing 
each of the work requests in the queue, each system 
determines whether there is a work request in the queue 
with an affinity only to a subset of the cluster that does 
not have servers for the queue and, if so, starts a server 
for the queue on a system in the subset to which the 
work request has an affinity. 
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